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DISTRIBUTED  COMPUTER  SYSTEMS:  STRUCTURE  AND  SEMANTICS 


Abstract 

Tnis  report  describes  on  ongoing  project  in  the  area  of  design  of 
distributed  systems.  The  goal  is  to  develop  an  effective  programming  system 
tnat  will  support  well-structured  design,  implementation,  maintenance  and 
control  of  distributed  processing  applications.  This  programming  system 
combines  a powerful  high  level  language  and  operating  system  features,  and 
addresses  tne  underlying  system  problems  that  affect  the  reliability  and 
security  perceived  on  the  application  level.  The  report  presents  a conceptual 
nodel  of  distributed  computation,  and,  in  the  context  of  this  model,  discusses 
our  approaches  to  inter-node  communication  and  cooperation,  reliability,  and 
protection.  One  of  tne  basic  goals  of  our  project  is  to  allow  the  application 
programmer  to  work  with  application-oriented  entities.  Thus,  inter-node 
messages,  error  handling  and  protection  constraints  will  all  be  expressible. iri 
application  oriented  terms.  The  report  concludes  with  some  examples  of  the 
language  constructs  and  an  outline  of  the  future  research  under  this  project. 
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DISTRIBUTED  COMPUTER  SYSTEMS:  STRUCTURE  AND  SEMANTICS 

Abstract 

This  report  describes  on  ongoing  project  in  the  area  of  design  of 
distributed  systems.  The  goal  is  to  develop  an  effective  programming  system 
tnat  will  support  well-structured  design,  implementation,  maintenance  and 
control  of  distributed  processing  applications.  This  programming  system 
combines  a powerful  high  level  language  and  operating  system  features,  and 
addresses  the  underlying  system  problems  that  affect  the  reliability  and 
security  perceived  on  the  application  level.  The  report  presents  a conceptual 
model  of  distributed  computation,  and,  in  the  context  of  this  model,  discusses 
our  approaches  to  inter-node  communication  and  cooperation,  reliability,  and 
protection.  One  of  tne  basic  goals  of  our  project  is  to  allow  the  application 
programmer  to  work,  with  application-oriented  entities.  Thus,  inter-node 
messages,  error  handling  and  protection  constraints  will  all  be  expressible  in 
application  oriented  terms.  The  report  concludes  with  some  examples  of  the 
language  constructs  and  an  outline  of  the  future  research  under  this  project. 


1.  INTRODUCTION 


Computer  systems  shouLd  reflect  the  structure  and  needs  of  the  problem  to 
wnich  they  are  being  applied.  For  many  applications,  a distributed  computer 
system  represents  a natural  realization.  For  both  technical  and  economic 
reasons,  it  is  Likely  that  for  many  existing  applications,  distributed 
computer  systems  will  replace  conventional  computer  systems  built  around  a 
Large  central  processor,  and  that  new  applications  will  emerge  based  on 
distributed  information  processing. 

The  area  of  "distributed  systems"  has  become  a popular  source  of  systems 
research  projects.  This  trend  has  been  supported  mainly  by  the  rapidly 
falling  cost  of  computing  hardware  and  the  increasing  power  and  flexibility  of 
nini  3nd  microcomputers.  However,  many  research  efforts  in  this  area  seem  to 
miss  the  most  important  aspect  of  the  revolution  in  the  hardware  costs  and 
power:  the  steadily  decreasing  entry  cost  of  acquiring  and  operating  a 

f ree-stand  ing,  complete  computer  system  encourages  lower-level  units  within  a 
large  organization  to  acquire  their  own  computers  that  consequently  will 
operate  somewbat  independently  and  autonomously  from  one  another.  The 
administrative  autonomy  is  really  the  driving  force  that  leads  to  acquisition 
of  locaL  computers  dedicated  to  the  applications  of  a particular  organization 
unit.  However,  it  is  necessary  to  anticipate  that  these  autonomous  computer 
systems  will  have  to  be  at  least  loosely  coupled  into  a cooperating 
confederacy  that  serves  as  the  information  system  of  the  organization. 

The  basic  technical  problem  in  chese  emerging  systems  is  to  provide 
coherence  in  communication  among  the  nodes  in  a computer  network  while  these 
nodes  retain  their  administrative  autonomy.  Technically,  autonomy  appears  as 
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a force  producing  incoherence:  one  must  assume  that  operating  schedules. 
Loading  policy.  Level  of  concern  for  security,  availability,  and  reliability, 
update  level  of  hardware  and  software,  and  even  choice  of  hardware  and 
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software  systems  will  tend  to  vary  from  node  to  node  with  a minimum  of  central 
control.  Further,  individual  nodes  may  for  various  reasons  occasionally 
completely  disconnect  themselves  from  the  confederacy,  and  operate  in 
isolation  for  a while  oefore  reconnecting.  Yet  to  the  extent  that  agreement 
and  cooperation  are  beneficial,  there  will  be  a need  for  communication  of 
signals,  exchange  of  data,  mutual  assistance  agreements,  and  a wide  variety  of 
otner  internode  interaction.  We  hypothesize  that  one-at-a-time  ad  hoc 
arrangements  will  be  inadequate,  because  of  their  potential  large  number  and 
tne  programming  cost  in  dealing  with  each  node  on  a different  basis. 

The  move  toward  distributed  systems  will  be  dictated  not  just  by  their 
"naturalness",  but  also  by  the  many  technical  advantages  they  offer  over 
centralized  systems.  These  advantages  include  the  following: 

Availability.  Availability  of  information  can  be  increased  by 
replicating  it  at  several  nodes.  This  arrangement  not  only  increases  the 
access  bandwidth  to  the  information,  but  in  case  of  a failure  of  one  of 
the  nodes  or  some  communication  link,  the  information  remains  accessible. 
Protection.  Distributed  systems  provide  a better  environment  for 
protecting  information  stored  in  the  system  and  for  coping  with  run-time 
errors  resulting  from  hardware  failures  or  residual  design  and 
implementation  errors.  These  advantages  arise  from  the  actual  physical 
separation  of  independent  or  loosely  coupled  computations  and  information 
tnat  belongs  to  different  users.  The  physical  boundaries  of  individual 
nodes  provide  "firewalls"  that  (if  properly  designed)  will  prevent 
spreading  of  errors  originating  in  a particular  node  to  the  rest  of  the 
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system  and  protect  information  stored  at  individual  nodes  from 
unauthorized  access  or  modification  by  other  nodes.  As  the  most  severe 
protection  measure,  a seif-cu..tained  node  can  be  guaranteed  privacy 
during  some  sensitive  operation  by  physicaLly  detaching  it  from  the  rest 
of  tne  system. 

Hxpandab i l i ty . As  more  users  join  the  system  or  new  services  are  added, 
it  is  not  necessary  to  make  any  physical  replacements  in  a distributed 
system.  Rather,  one  or  more  new  nodes  need  to  be  added  to  the  system;  if 
the  system  is  designed  properly,  it  may  be  possible  to  accomplish  this 
without  interrupting  the  service  of  the  existing  system.  Thus, 
distributed  systems  offer  a potential  for  a more  gradual  and  smoother 
growth  than  systems  with  a large  central  processor. 

Thus,  there  are  many  sound  reasons  why  applications  should  be  implemented 
as  distrihuted  systems.  However,  while  it  has  been  successfully  demonstrated 
that  it  is  not  very  difficult  to  interconnect  remote  computers  at  the 
electrical  and  information  bit  level,  the  effective  utilization  of  such  a 
network  at  a higher  software  and  applications  level  is  still  missing.  The 
project  discussed  in  this  report  is  aimed  at  solving  the  technical  problems 
that  hinder  the  development  of  applications  for  distributed  systems.  In 
particular,  the  goal  of  this  project  is  to  develop  an  integrated  programming 
language  and  operating  system  to  support  well-structured  design  and 
implementation  of  distributed  applications. 

1.1  Distributed  Systems  Of  Interest 

The  distributed  systems  considered  in  our  project  can  be  described 
loosely  as  organizations  of  highly  autonomous  information  processing  modules, 
called  nodes , which  cooperate  in  a manner  that  produces  an  image  of  a coherent 
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system  of  a certain  defined  Level.  Autonomy  is  the  key  characteristic  that 
eliminates  most  muit iproc.ssor  organizations  from  this  class  of  distributed 


systems.  Certainly,  a distributed  system  has  more  than  one  processor,  since 
it  has  at  least  one  processor  in  each  node.  However,  in  a distributed  system, 
tne  nodes  are  highly  independent,  each  having  its  own  primary  memory,  possibly 
some  secondary  storage,  and  its  own  interface  through  which  it  communicates 
with  its  environment  (e.g.  user  terminals,  sensors) . The  individual  nodes  are 
connected  by  a communication  network;  the  communication  delay  may  be  highly 
variaole  and  unpredictable.  Tne  communication  network  might  be  a long-haul 
network  such  as  tne  ARPANET  [R0BE7QI , a local  area  network  [CLAR73] , or  a 
suitaoLe  combination  of  these  two  types.  Each  node  has  access  to  its  own 
memory  only;  that  is,  inter-node  communication  is  possible  only  by  explicitly 
exchanging  messages,  not  tiirough  shared  memory.  Finally,  physical 
(geographical)  reorganization  of  the  nodes  and  the  communication  network  is 
assumed  not  to  impair  the  system's  functionality;  tne  only  change  might  be  in 
the  system's  performance. 


1.2  Comparison  Of  Our  Approach  Wien  Related  Work 

The  assumption  of  autonomy  of  the  nodes  that  compose  a distributed  system 
is  the  most  important  ingredient  that  distinguishes  our  work.  However,  once 
autonomy  is  assumed,  the  next  issue  chat  arises  is  to  devise  techniqes  that 
permit  programs  running  on  the  autonomous  nodes  to  communicate  in  a coherent 
fashion.  We  are  aiming  at  a high  leveL  of  coherence  that  is 
appl icat ion- independent  but  permits  communication  among  the  nodes  in 
appl icat ion-or iented  terms.  This  high  level  of  application-independent 
coherence  distinguishes  our  approach  from  other  work  that  is  based  on  the 
assumption  of  autonomous  nodes.  Most  work  has  either  provided  a very  low 
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level  ot  coherence  (e.g.  tne  ARPANET)  or  has  provided  coherence  within  a 
specific  application  (e.g.  the  NSW  works  manager  [MILL77] ) . There  is  some 
work  related  to  ours  in  progress  at  Xerox  PARC,  but  again  this  work  is 
focussing  on  a very  specific  application  — office  automation. 

The  probiem  of  simultaneous  update,  making  an  identical  or  a logically 
related  change  at  several  sites,  has  received  considerable  study  [GRAY78, 

MONT 78 , REED78 , ROTH77,  STEA76,  TAKA 7 8 , THOM76) . However,  we  remain 
unconvinced  that  a solution  to  this  particular  problem  is  crucial  to  our 
researcn.  Rather,  we  view  our  system  as  providing  an  environment  in  which  any 
one  of  several  simultaneous  update  algorithms  can  be  implemented  as  needed. 
This  point  distinguishes  our  work  from  SDD-1  [ROTH77],  for  example,  since  that 
project  assumes  a very  particular  technique  for  implementing  simultaneous 
update.  SDD-1  also  makes  very  restrictive  assumptions  about  the  autonomy  of 
tne  nodes  of  the  system. 

Distributed  systems  have  only  lately  become  a focus  of  programming 
language  research.  In  the  past,  programming  languages  have  mostly  not 
addressed  concurrent  programs.  More  recent  languages  (e.g.  Concurrent  Pascal 
[BRIN731,  Modula  [WlRF/7!)  have  had  features  for  concurrency,  but  within  the 
context  of  a single  processor:  these  languages  are  based  on  the  assumption 
that  programs  interact  through  shared  memory,  which  is  not  consistent  with  the 
concept  of  autonomous  nodes  with  private  memory.  There  is  related  work  at 
Oxford  [HOAR77],  the  University  of  Rochester  [FELD77I  and  at  MIT  [DENI75, 
HEW1761 , but  this  work  does  not  place  strong  emphasis  on  integrating  the 
language  and  operating  system  features. 

Indeed,  we  feel  that  our  emphasis  on  integration  of  language  and  system 
is  a further  key  factor  in  our  work  that  distinguishes  it  from  other  related 
work.  Much  of  what  distributed  programs  do  falls  into  what  is  usually 
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considered  to  be  the  systems  area,  including  such  topics  as  synchronization  of 
access  to  shared  information,  and  protection.  However,  programs  are  written 


in  a programming  language,  and  proper  primitives  in  that  language  can  greatly 
influence  the  structure  of  programs.  By  integrating  the  two  areas  we  expect 
to  achieve  a greater  impact  on  the  construction  of  distributed  systems  than 
could  be  accomplished  in  either  area  separately. 


» 


2.  STUDY  OF  APPLICATIONS 

It  is  essential  that  the  mechanisms  we  develop  to  support  construction  of 
distributed  applications  cover  the  real  distributed  processing  problems.  To 
this  end,  we  have  studied  a number  of  applications,  both  by  direct  observation 
[SV0B7SA,  SVOB78BI  and  by  surveying  related  work  as  discussed  earlier.  This 
study  was  hampered  by  the  lack  of  existing  distributed  systems;  for  example, 
banking  systems  are  not  yet  distributed,  although  a distributed  system  is 
being  planned.  Therefore,  we  had  to  supplement  our  study  by  sketching  designs 
for  future  systems. 

Several  different  classes  of  distributed  activities  have  been  identified: 
Invocat ion  of  remote  servers . A message  is  sent  to  a remote  node 
instructing  some  server  at  the  node  to  perform  a certain  operation;  a 
reply  (requested  information  or  an  acknowledgement  if  no  data  is  to  be 
returned)  confirms  that  the  operation  has  been  performed.  The  mail 
system  in  the  ARPANET  is  an  example  of  this  type  of  application. 

Atomic  Transact  ions  on  Distributed  Databases . To  preserve  the  integrity 
of  a database,  it  may  be  necessary  to  provide  a mechanism  that  guarantees 
tnat  either  all  updates  specified  by  a transaction  will  be  performed,  or 
none,  no  matter  how  the  transaction  fails. 
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Distributed  Data  Processing  if  the  Large  quantities  of  data  residing  at 
different  nodes  are  processed,  a probLera  may  arise  even  if  no  updates  are 
performed,  which  is  to  minimize  the  data  moved  between  nodes  in  order  to 
perform  the  desired  operation.  An  exampLe  is  query  processing  in  a 
distributed  database  system. 

D i s t r ibuted  Prob iem  Solving.  This  describes  systems  where  the  cost 
(overhead)  of  maintaining  a centralized  global  view  of  the  system  state 
and  control  is  proh  ib L t ive . In  such  systems,  each  node  knows  only  a 
p.irtial  state  of  the  system  and  has  to  make  intelligent  guesses,  from  the 
information  received  from  other  nodes,  about  the  rest  of  the  system.  An 
example  of  such  an  application  is  a dynamic  routing  algorithm  for 
store-and- for ward  networks. 

Distributed  Programming  System.  Tnis  is  a distributed  version  of  a 
generaL  purpose  time  sharing  system.  The  assumption  is  that  it  is  not 
possiole  to  restrict  in  advance  the  modes  of  sharing  among  users.  It  is 
necessary  to  communicate  both  data  and  programs,  but  from  the  point  of 
view  of  the  mechanics  of  the  actual  exchange  of  information  this  type  of 
system  could  be  included  in  the  first  category. 

Die  distribution  can  take  place  along  two  main  lines,  based  on  functional 
separaoility  or  on  the  non-uniform  distribution  of  the  use  of  databases. 
Functionai  distribution  means  that  different  nodes  support  different  services. 
Such  systems  seem  natural  for  control  of  industrial  processes,  where  different 
nodes  control  different  parts  of  a process,  or  in  such  systems  as  aircraft, 
where  different  nodes  process  iiformation  from  different  sensors.  However, 
tnis  approach  seems  also  to  be  advantageous  in  service  sec  irs  such  as  banking 
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Database  distribution  characterizes  systems  where  an  individual  processor 


supports  the  same  services  but  on  a different  part  of  a database.  A typical 
example  is  a bank  with  many  branch  offices.  Each  branch  has  its  local 
accounts,  but  it  should  be  able  to  serve  a bank's  customer  whose  account  is  at 
another  branch.  Since  such  remote  requests  are  much  less  frequent  than 
manipulation  of  tne  local  accounts,  partitioning  of  the  bank's  accounts 
database  (that  is,  maintaining  accounts  on  a computer  at  their  local  branch) 
is  a natural  approach. 

It  must  be  said  that  the  division  between  functional  distribution  and 
database  distribution  is  not  clean;  in  most  cases,  a distributed  system  will 
to  some  extent  include  both.  The  latter  case,  however.  Implies  an  integrated 
database,  while  in  the  former  case  (functional  distribution)  the  databases 
used  by  individual  servers  are  much  more  independent.  The  functional 
distrloution  is  the  more  general  case.  In  either  case,  a distributed  database 
represents  a special  problem,  the  need  to  enforce  consistency  constraints  that 
span  several  nodes.  It  is  not  clear  how  often  this  problem  actually  arises, 
but  it  cannot  be  ignored. 

It  can  be  concluded  that  the  basic  paradigm  in  the  class  of  distributed 
systems  that  our  project  is  addressing  is  the  Invocation  of  remote  servers. 
This  can  be  viewed  as  a communication  protocol  of  much  higher  level  than,  for 
example,  the  host-to-host  communication  protocols  currently  employed  in  he 
ARPANET.  The  implementation  of  such  high  level  protocols,  however,  may  need 
to  differ,  depending  on  the  type  of  application,  and  possibly  on  the 
efficiency  and  reliability  requirements  of  the  application.  Therefore,  we 
should  not  aim  to  design  such  high  level  protocols,  but  Instead  develop  a set 
of  tools  that  facilitate  design  and  Implementation  of  such  protocols. 
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Finally,  an  application  study  by  d'Oliveira  [DOL177)  revealed  an 
important  result:  tnere  are  strong  pressures  toward  decentralization  for 
sociological  and  political  rather  than  technical  reasons.  We  infer  from  this 
study  that  decisions  about  the  distribution  of  information  among  the  various 
nodes  will  ne  made  for  external  reasons  that  only  the  application  itself  can 
specify.  Tnus,  the  application  builder  must  have  control  over  and  understand 
tne  placement  of  information. 


3.  THE  TARGET  OF  THE  PROJECT 

To  summarize,  we  view  a distributed  system  as  a collection  of  autonomous 
nodes  that  communicate  only  by  information  exchange  over  tne  communication 
network  that  connects  them.  In  such  a system,  at  least  two  levels  of 
coherence  must  be  enforced.  One  level  is  the  application  level  itself.  The 
second  level  is  the  sat  of  internode  communication  protocols  that  facilitate 
the  physical  exchange  of  information  (packets  of  bits).  But  there  is  a large 
gap  between  the  application  and  the  low  level  communication  protocols. 

Usually,  this  gap  results  in  a rattier  ad  hoc  implementation  of  the 
appl icat ion . 

Our  target  is  an  intermediate  level,  called  the  programming  system,  which 
will  support  a well-structured  design,  implementation,  maintenance  and  control 
of  distributed  applications.  This  level  is  more  than  a programming  language 
in  a traditional  sense.  Rattier,  this  level  is  envisioned  as  a set  of  tools 
that  include  primitives  found  in  conventional  higher  level  languages  such  as 
Pascal  or  PL/l,  out  also  primitives  normally  assumed  to  be  part  of  an 
operating  system,  for  example,  long-term  storage  and  cataloging  of  information 
or  control  of  protection  safeguards.  Thus,  this  programming  level  will 


integrate  tne  programming  language  and  Che  operating  system.  More  strongly. 


! 
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tnis  level  will  integrate  a programming  Language  and  a distr ibuted  operating 
system. 

rtie  design  goals  for  the  programming  system  include: 

Aim  for  as  n ign  a leve l as  poss ible , but  appl icat ion  independent . Our 
system  is  intended  to  be  used  to  implement  many  diverse  applications,  for 
example,  both  command  and  control  systems  and  administrative  systems  like 
inventory  control  systems.  To  adequately  support  such  a class  of 
appl icat ions,  the  language  should  be  as  high  level  as  possible  but 
general  purpose.  One  need  that  all  applications  share  is  the  ability  to 
exchange  potentially  quite  sophisticated  messages. 

Support  well-structured  programming.  Since  our  primary  motivation  is  to 
ease  the  task  of  the  application  programmer,  we  feeL  that  the  embedded 
language  should  borrow  from  existing  language  work,  in  particular 
building  on  Languages  such  as  CLU  [LISK77]  and  ALphard  [WUL.F76] , which 
aid  in  the  production  of  well-structured  programs  by  providing  powerful 
abstraction  mechanisms.  Of  particular  importance  is  the  data 
abstraction,  which  consists  of  a set  of  objects  together  with  a set  of 
operations  th.it  provide  the  only  means  for  manipulating  those  objects. 
Data  abstractions  tiave  been  investigated  so  far  mainly  in  the  context  of 
centralized  processing.  We  believe  that  they  will  be  even  more  useful  in 
the  type  of  distributed  systems  assumed  in  our  work,  because  they  provide 
a powerful  tool  in  organizing  a coherent  structure  where  the  data  of  the 
application  and  the  allowed  distributed  sharing  is  described  in 
application-oriented  terms.  Independent  of  the  idiosyncracies  of  the 
individual  autonomous  nodes. 
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Since  we  are  dealing  with  a distributed  environment  where  an 


operation  defined  on  the  application  level  may  require  the  assistance  of 
several  nodes,  the  language  must  support  concurrent  activities  (process 
abstractions).  Extensions  of  sequential  languages  will  be  necessary  to 
achieve  this.  To  enhance  ease  of  use,  we  will  keep  the  language  as 
conventional  and  conservative  as  is  consistent  with  our  other  goals. 
Support  communicat ion  in  terms  of  abstract  ob  jects.  Autonomous  program 
units  need  to  commune iate  in  terms  of  the  kinds  of  high  level  objects 
they  manipulate.  For  example,  the  ARPANET  supports  one  sort  of  "high 
level  object",  tne  ASCII  file,  but  any  other  form  of  data  must  be 
transmitted  as  a sequence  of  bits  and  explicitly  transformed  from  one 
representation  to  another  by  a user  written  program.  The  language  should 
support  communication  in  terms  of  abstract  objects,  regardless  of  the 
relative  location  of  the  sender  and  the  recipient  of  a message.  Two 
advantages  arise  from  tilts  approach.  First,  a clear  statement  can  be 
made  about  the  properties  of  data  that  the  units  depend  on.  Second,  it 
is  clear  how  to  accomplish  the  processing  that  is  needed  to  translate  an 
object  in  memory  into  1 message  transportable  by  the  communication 
network  and  vice  versa:  the  translation  is  accomplished  using  special 
operations  of  the  object's  type.  Note  that  this  nind  of  translation  is 
always  needed;  a language  that  requires  messages  to  be  composed  of  low 
level  objects  simply  obscures  this  fact. 

Allow  explicit  cont  rol  of  the  >ppl ir  at ion  d i st  r 1 but  ion ■ Conceptually, 
the  target  level  cun  be  viewed  as  an  ibstract  network  of  processes  where 
appl icat ion-def l ned  processes  communicate  via  messages  that  contain  high 
level  commands,  data  md  responses.  in  an  lde»l  situation,  this  is  all 


that  would  need  to  be  seen  by  the  application  programmer.  However, 


I 
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underneath  this  abstract  network  is  the  set  of  physical  nodes  and  the 
communication  lines  that  connect  them.  Our  study  of  applications  has 
indicated  that  the  mapping  of  the  objects  used  by  an  application  into  the 
physical  set  of  nodes  has  to  be  made  visible  to  the  application 
programmer.  We  are  assuming  that  objects  to  not  move  dynamically  from 
node  to  node,  depending  on  the  degree  of  demand  (sucn  dynamic  migration 
is  often  assumed  in  the  "distributed"  systems  consisting  of  many, 
relatively  tightly  coupled,  mini  or  microprocessors).  Rather,  when  a 
specific  node  is  cnosen  to  be  the  (new)  home  of  a particular  object,  an 
insta l lat ion  of  the  object  has  to  be  explicitly  requested  using  commands 
provided  by  the  programming  system.  This  assumption  is  based  on  the 
Relief,  discussed  earlier,  that  many  of  such  placement  decisions  will  be 
based  on  non-technical  factors  external  to  the  system  [DOLI77) . 

Support  snaring . The  programming  level  must  support  sharing  of 
information  represented  as  objects  that  reside  at  different  nodes  and 
Belong  to  different  users,  where  how  the  information  is  to  be  shared  is 
defined  by  tne  application.  An  important  aspect  of  sharing  is  to  provide 
controls  that  regulate  the  patterns  of  sharing  so  that  protection  and 
synchronization  constraints  are  properly  met.  It  is  also  necessary  to 
solve  problems  of  naming  across  nodes. 

Support  re  1 iab  le  ( robust)  operations.  Reliability  is  one  of  the  most 
important  goals  of  our  project.  A distributed  system,  by  its  very 
nature,  provides  a potent ial  for  enhanced  reliability.  However,  to 
exploit  this  potential,  the  system  and  the  application  have  to  be 
properly  designed.  An  arriving  message  must  be  tested  for  integrity  and 
authent ic ity,  using  a combination  of  automatic  system  features  and 
application  dependent  procedures,  and  there  must  be  control  over  timeouts 


14 


and  toe  number  of  retries  for  messages  sent  but  for  wnlch  a reply  bas  not 


oeen  received.  It  is  also  desirable  to  have  a means  for  specifying  that 
in  online  backup  copy  is  requested  for  an  object. 

Support  cli. nixing  patterns  of  use.  We  cannot  expect  an  application  to  be 
written  once  and  never  modified.  First,  tne  system  will  grow  by  the 
idditlon  ot  new  nodes.  Second,  new  patterns  of  use  will  arise  involving 
existing  or  new  pieces  of  information.  Thus,  we  can  expect 
syncnroiilz.ic  Ion  and  protection  constraints  to  change  with  time.  This 
•bulge  nust  not  cause  upheaval  in  the  design  of  existing  parts  of  the 
ipp l lent  ion . 

We  want  to  e.npnasize  tint  the  envisioned  programming  system  is  not 
intended  for  the  end  user,  but  for  the  application  builder  (programmer), 
although  in  some  environments  (such  as  LCS)  there  is  often  little  distinction 
between  the  two  classes  of  users.  Also,  it  should  not  be  necessary  for  all 
nodes  in  the  distributed  system  to  support  the  full  language;  each  node  need 
•nlv  support  the  ippropriite  (hign  level)  internode  communication  protocol. 

4.  COHERENCE  VS  AUTONOMY 

Since  lutonomy  is  such  > basic  property  in  our  model  of  distributed 
systems,  i natural  question  to  ask  is  how  much  the  coherence  that  we  strive 
for  in  our  project  constrains  the  autonomy  of  individual  nodes.  The  last 
section  indicated  that  at  least  three  levels  of  coherence  are  required: 

1)  application  level 

2)  programming  level  for  distributed  applications 
l)  inter-node  communication  protocols 
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The  need  far  coherence  on  leveLs  l and  3 Ls  unquestionable.  However,  level  2 
may  not  be  necessary,  especially  if  the  system  Is  designed  for  a single 
specific  application.  Level  2 nas  been  introduced  as  a result  of  a desire  to 
provide  an  effective  programming  environment  for  implementation  of  a variety 
of  distributed  appl icat ions . Thus,  tne  above  question  can  be  restated  as: 
does  tnis  intermediate  level  impose  any  additional  constraints  on  the  autonomy 
of  individual  nodes?  We  believe  that  it  does  not  add  new  constraints,  but 
merely  divides  the  constraints  into  two  categories:  tne  constraints  that  have 
to  oe  observed  by  the  application  programmer,  and  the  constraints  that  the 
application  programmer  no  longer  has  to  be  aware  of,  since  the  programming 
system  handles  them  for  him. 

Basically,  autonomy  has  two  different  aspects: 

Operational  Autonomy . A node  can  operate  (at  least  to  some  degree)  even 
while  it  is  completely  cut  off  the  rest  of  the  system. 

Administrat ive  Autonomy . The  owner  of  a node  can  exercise  a certain 
degree  of  control  over  the  use  of  the  node  even  while  the  node 
pirticipates  in  a distributed  application. 

Both  of  these  aspects  have  to  be  considered  in  the  design  of  a distributed 
system;  both  interfere  with  the  goal  of  coherence. 

A good  discussion  of  the  mechanisms  needed  to  solve  the  problem  of 
operational  autonomy  can  be  found  in  [MONT78].  Administrative  autonomy  is  a 
broader  and  less  understood  issue.  On  the  application  level,  administrative 
autonomy  is  associated  with  an  individual  or  a group  of  people,  users  or 
project  managers.  Potent  ia L ly , more  than  one  autonomous  project  or  service 
could  be  supported  on  the  same  node,  or  several  nodes  could  form  such  an 
autonomous  unit.  The  coherence  required  to  join  such  autonomous  projects  or 
services  is  determined  strictly  by  the  application;  there  are  also  many 
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situations  where,  on  the  application  level,  local  autonomy  is  completely 
supressed.  Tnus,  the  application  limits  the  autonomy  affordable  in  the 
individual  nodes.  The  notion  of  administrative  autonomy  needs  to  be  defined 
more  formally;  at  the  present  time,  only  some  practical  examples  of  the 
possible  manifestations  of  autonomy  are  given: 

So  lec t ion  of  the  hardware  components  and  their  configurations.  Even  if 
tiie  overall  system  has  to  be  homogeneous  at  the  machine  level,  it  is 
still  necessary  to  select  for  each  node  a particular  processor  the  amount 
'«  of  main  memory  and  secondary  storage,  etc.  Such  decisions  can  be 

entirely  autonomous;  they  weill  depend  mostly  on  the  type  and  the  amount 
of  worx  to  be  done  at  tne  particular  node,  rattier  than  being  dictated  by 
the  ooligation  to  support  a common  high-level  language. 
hoc a 1 Software . In  addition  to  supporting  the  common  programming 
language  and  tiie  inter-node  communication  protocols,  additional  software 
for  local  needs  can  be  developed,  maintained,  and  run  on  individual 
nodes.  Also,  the  implementation  of  the  common  programming  language  and 
the  communication  protocols  can  differ  from  one  node  to  another,  as  long 
! . as  the  implementation  conforms  to  the  specifications.  It  also  may  be 

possible  that  some  nodes  need  not  support  more  than  only  a small  subset 
of  the  common  language  in  order  to  be  able  to  handle  their  parts  of  the 
distributed  application. 

Naming . Each  node  can  have  its  own  independent  way  of  naming  its  local 
objects.  Lower  level  names  may  be  completely  inaccessible  from  outside 
the  node.  An  application,  however,  may  enforce  a uniform  one-level 
naming  space.  Alternatively,  a structured  name  space  may  be  used,  where 
| each  name  consists  of  the  name  of  the  node  containing  the  object  and  the 

local  name  for  that  object.  If  the  fact  that  the  system  is  distributed 
J 
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must  be  Invisible  to  tbe  end  user,  the  name  space  seen  by  the  user  must 
be  one- level.  Somehow,  however,  this  type  of  name  must  eventually  be 
translated  into  a structured  name  space  to  locate  the  desired  object. 

Thus  the  extent  to  which  the  autonomy  in  assigning  names  is  constrained 
depends  on  tne  application. 

Av a i lab i 1 ity  of  local  resources . Each  node  can  control  access  to  its 
private  objects  (not  part  of  the  common  application).  Local  control  of 
objects  that  are  part  of  the  common  application  may  be  restricted  by  the 
application.  Similarly,  the  need  to  use  local  hardware  resources  imposes 
constraints  on  the  individual  nodes;  a node  may  not  be  free  to  refuse  a 
request  to  perform  some  service  in  order  to  do  some  local  work  or 
disconnect  for  maintenance.  While  these  constraints,  due  to  the 
decentralized  control  of  individual  nodes,  cannot  be  enforced  physically 
at  the  moment  of  crisis,  they  can  be  enforced  outside  of  the  system,  e.g. 
by  legal  means. . 


5.  MODEL  OF  DISTRIBUTED  COMPUTATION 

Tnis  section  presents  a model  of  the  universe  of  entities  that  take  part 
in  a distributed  computation.  We  assume  that  each  entity  has  an  identity  that 
is  permanent;  an  entity  can  be  referred  to  by  giving  its  name . We  are  not 
concerned  here  with  all  aspects  of  the  behavior  of  entities,  but  rather  limit 
our  attention  to  questions  concerning  the  locations  of  entities  within  the 
network  and  how  entities  can  refer  to  other  entities. 
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5.1  Types  Of  Entities 

The  basic  model  of  distributed  computing  provides  two  different  kinds  of 
entities:  processes  and  everything  else.  A process  is  active,  and  is  thought 

of  as  being  the  execution  of  a sequential  program.  Non-process  entities, 
wliicn  we  will  call  ob  jects,  are  passive,  i.e.,  they  do  not  originate  any 
activity.  Examples  of  objects  are  integers,  arrays,  stacks,  procedures,  etc. 
Oojects  have  a state  (value)  tnat  may  change.  If  the  state  can  change  during 
the  object's  lifetime,  then  tne  object  is  mutable. 

A process  can  communicate  with  another  process  by  sending  it  a message. 

We  assume  that  the  syntax  and  semantics  of  message  passing  is  independent  of 
tne  nodes  of  residence  of  tne  two  communicating  processes  (although  certain 
optimizations  can  be  performed  by  the  system  if  both  processes  reside  at  the 
same  node).  A process  can  use  an  object  by  performing  (invoking)  an  operation 
on  it  (or  by  invoking  it  if  it  is  a procedure);  again,  the  semantics  of 
invocation  is  the  same  regardless  of  the  nodes  of  residence. 

Tne  decision  about  whether  the  entities  used  in  the  model  of  computation 
are  all  uniform  or  whetner  the  model  distinguishes  different  classes  of 
entities  is  a fundamental  one.  Basically,  the  uniformity  concerns  the  ways  in 
wnich  entities  may  be  used  (and  may  use  other  entities).  In  our  model,  two 
basic  primitive  operations  are  used:  invocation  and  message  passing.  We 
intend  that  the  semantics  of  invocation  be  distinct  from  message  passing;  the 
following  sections  will  clarify  the  reasons  for  this  distinction. 

An  example  of  a computational  model  in  which  all  entities  are  uniform  is 
the  Actor  System  [HEWI7b|.  In  this  system,  every  entity  is  an  actor,  and  an 
actor  is  used  by  sending  it  a message  that  is  also  an  actor.  There  is  only 
one  basic  primitive,  message  passing,  so  our  model  seems  more  complicated. 
However,  we  believe  that  it  is  more  natural  than  the  actor  model  and  will 
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therefore  be  easier  for  programmers  to  understand.  If  programs  built  out  of 


actors  are  examined,  it  is  clear  that  there  are  "data-like"  actors, 

"procedure- L ike"  actors  and  "process-like"  actors.  We  b ieve  these 
differences  are  fundamental  and  should  be  reflected  in  the  language  in  its 
semant ics . 

5.2  Location  Of  Entities 

rhe  universe  of  entities  is  spread  across  the  physical  nodes  that  make  up 
the  network.  One  important  question  concerns  the  location  of  entities:  is  an 
entity  permanently  located  at  a particular  node,  or  can  it  move  from  node  to 
j node? 

To  make  a decision  here,  we  must  consider  several  issues: 
i.  Earlier  we  discussed  our  conclusion,  based  on  an  analysis  of 
appl icat Ions,  that  the  application  programmers  must  be  able  to 
control  the  Location  of  entities.  Note  that,  at  the  Least,  this 
conclusion  precludes  automatic  relocation  of  entities  by  the  system, 
although  relocation  under  program  control  wouLd  stilL  be  possible. 

, ii.  We  are  assuming  that  nodes  are  autonomous  and  possibly 

heterogeneous.  Even  under  program  control  it  is  possible  to  move  an 
entity  to  an  autonomous  node  only  if  that  node  is  willing  to  accept 
it.  Furthermore,  if  that  node  is  different  from  the  current  home 
node  of  the  entity,  considerable  translation  may  be  needed  to 
effectively  move  the  entity. 

Inerefore,  we  beLieve  that  entities  shouLd  have  a permanent  iocat ion  at 
some  node  in  tile  network.  An  entity  comes  into  existence  at  some  node  (when 
it  is  created)  and  remains  at  th.it  node  until  it  is  destroyed.  Moving  an 

% 

entity  can  be  accompLisned  by  having  a program  create  a new  entity  and  Letting 
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it  " take  over"  from  the  oLd  one;  However,  the  relationship  between  the  two 
entities  is  not  recognized  by  the  system,  and  represents  a higher  level 
concept  of  identity  than  that  introduced  above. 

One  consequence  of  this  decision  is  that  it  will  be  easy  for  the  system 
to  create  unique  names  for  entities  and  to  interpret  entity  names,  since  the 
node  of  residence  can  be  part  of  the  name. 

5.3  Restrictions  On  Referring  To  Entities 

An  entity  may  refer  to  another  entity  by  using  or  containing  its  name. 

For  example,  a process  will  have  local  variables  that  may  contain  the  names  of 
ottier  entities  (both  processes  and  objects);  as  the  process  executes,  it  can 
use  tnese  names.  A data  object  is  represented  by  some  storage  (at  its  node  of 
residence),  and  some  of  this  storage  may  contain  names  of  entities  (again  both 
processes  and  objects). 

In  our  model,  the  universe  of  objects  is  divided  into  mutually  exclusive 
sets.  A process  is  associated  permanently  (tnat  is,  for  its  lifetime)  with  a 
single  specific  set.  Tne  objects  in  this  set  are  private  to  the  process;  the 
set  forms  tne  local  address  space  of  tne  process. 

w/e  nave  chosen  the  following  restrictions  on  how  the  entities  of  the 
model  can  refer  to  each  other: 

i.  a process  can  refer  directly  only  to  its  private  objects 
ii.  an  object  can  refer  only  to  the  objects  that  belong  to  the  same 
local  address  space. 

There  are  no  restrictions  on  referring  to  processes:  both  processes  and 
objects  can  name  other  processes. 

Tne  above  restriction  can  be  enforced  as  follows:  messages  can  contain 
tne  names  of  processes  out  not  the  n. lines  of  objects.  A model  obeying  this 


restriction  is  illustrated  in  Figure  l.  The  nodes  labeled  Pi  are  processes, 
wtiile  nodes  labeled  Oi  ire  objects.  Two  kinds  of  directed  arcs  are  shown.  A 
solid  arc  from  entity  x to  entity  y means  y is  a process  and  x names  y,  while 
a dashed  arc  means  , is  in  object  and  x names  y.  A process  may  (ultimately) 

refer  to  an  object  in  the  course  of  its  execution  if  there  is  a path  from  the 

process  to  the  object  consisting  entirely  of  dashed  arcs.  The  set  of  all  such 
objects  form  tne  local  address  space  of  the  process. 

Our  furttier  restriction  is  that  all  objects  in  the  same  local  address 
space  must  reside  on  the  same  physical  node.  Alternatively,  a process  could 
be  allowed  to  perform  an  operation  on  an  object  whether  that  object  resided  at 
the  same  node  or  not.  Invocation  of  an  operation  on  a remote  object  can  be 

made  to  work,*  but  has  the  disadvantage  that  what  appears  to  be  a simple 

invocation  will  involve  internode  communication,  and  therefore,  can  take  a 
long  time,  and  may  even  have  slightly  different  semantics.  in  particular,  a 
local  invocation  has  two  possible  outcomes: 

i.  the  operation  completed  successfully 
ii.  some  error  occurred. 

For  an  operation  on  a remote  object,  however,  another  kind  of  outcome  is 
possible,  in  particular,  if  no  reply  is  received  from  the  remote  node,  it  is 
simply  not  known  what  has  actually  happened.  The  message  to  or  from  the 
remote  node  could  have  been  lost,  or  the  node  could  have  failed  before,  after, 
or  in  the  middle  of  processing  the  request. 


* The  invocation  must  take  place  at  the  object's  node,  since  as  discussed  in 
the  proceeding  section  the  object  cannot  move  to  the  invoker's  node. 
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5.4  Guardians 


The  model  that  we  have  developed  thus  far  represents  an  abstract  network. 
Processes  are  the  nodes  of  this  abstract  network;  each  process,  analogously  to 
a physical  node,  has  its  private  memory  and  can  communicate  with  other 
processes  only  by  sending  messages. 

Tne  abstract  network  model  has  several  advantages: 

i.  The  programmer  organizes  the  locations  of  entities  by  considering 
where  to  locate  the  abstract  nodes,  (e.g.  each  process  with  its 
local  memory).  This  seems  easier  than  worrying  about  each  entity 
ind  ividually. 

ii.  Operations  are  always  invoked  locally.  This  is  simpler  to  implement 
than  remote  invocation,  and  also  avoids  some  arbitrary  time  delays.* 

iii.  Management  of  storage  for  objects  (e.g.  garbage  collection)  can  be 
done  locally  on  each  node. 

Although  two  processes  cannot  refer  to  the  same  object,  they  can  share  an 
object  if  they  both  name  the  process  that  can  refer  to  the  object.  Such  a 
process  will  be  called  a guardian.  A guardian  may  guard  one  or  several 
objects;  its  job  is  to  synchronize  possibly  concurrent  requests  to  perform 
operations  on  the  guarded  objects.  In  Figure  1,  P2  is  a guardian  for  06  and 
07,  which  are  shared  by  PI  and  P3. 

The  abstract  network  model  requires  two  extensions  to  be  useful.  First, 
tne  requirement  that  local  address  spaces  of  processes  are  disjoint  may  need 
to  be  relaxed.  Tnere  are  two  reasons  why  it  might  be  desirable  to  have 
several  processes  in  a guardian.  First,  the  processes  could  provide 


* Of  course,  the  operation  activation  might  send  a message,  e.g.  to  some 
process  whose  name  was  contained  in  one  of  the  input  objects  to  this 
operat ion. 
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additional  concurrency,  which  could  be  used  to  improve  response.  Second,  a 
process  could  provide  continuity.  If  a user  and  the  guardian  need  to  have  a 
conversation,  providing  a process  for  the  user  to  interact  with  would  be  a 
natural  means  of  realization.  The  processes  within  a guardian  would  share 
objects  directly.  A guardian  would  be  defined  using  a special  syntactic 
construct,  sometnlng  like  a serializer  [HEW1771,  that  specifies  the  processes 
making  up  the  guardian  and  their  intercommunication;  all  the  processes  in  the 
guardian  would  reside  at  the  same  node. 

Second,  in  the  case  of  a guardian  th3t  guards  several  objects,  some 
efficient  mechanism  is  needed  that  permits  a user  process  to  specify  to  the 
guardian  the  particular  object  of  interest,*  and  for  the  guardian  to  determine 
that  the  object  so  specified  is  one  it  guards.  Often  no  special  mechanism  is 
needed  for  this.  For  example,  a query  sent  to  a data  base  guardian  names  a 
record  (or  records)  in  the  data  base  by  means  of  a high  level  name.  However, 
it  appears  that  sometimes  efficient,  low  level  names  are  needed.  Consider  a 
guardian  of  a disk  that  provides  logical  tracks.  When  a user  requests  some 
data  to  be  stored  on  the  disk,  the  manager  returns  the  logical  track  number 
where  the  data  is  stored.  Later,  the  data  can  be  read  by  sending  the  logical 
track  number  back  to  the  guardian.  However,  the  logical  track  number  must  be 
one  created  by  the  guardian,  and  also  must  still  exist  at  the  time  of  the  read 
request . 

The  guardian  construct  can  model  different  degrees  of  autonomy.  For 
example,  if  a process  requests  an  operation  on  data  that  are  available  only 
through  the  guardian,  such  a request  may  fail  since  the  guardian  may  refuse  to 


* This  should  not  be  interpreted  as  a requirement  that  the  system  ought  to 
guarantee  that  an  object  continues  to  exist  as  long  as  some  user  can  specify 
it. 


release  requested  data,  or  in  some  cases  may  even  destroy  the  data  at  Its  own 
discretion.  Also,  a guardian  does  not  have  to  know  a priori  about  all 
processes  that  may  request  operations  on  the  guarded  objects.  That  is,  a 
guardian  can  be  a general  server  that  accepts  requests  from  any  process  or  a 
class  of  processes. 

5.5  Summary 

In  their  recent  paper,  Lauer  and  Needham  argue  that  message  passing  and 
procedure  calling  (invocation)  are  essentially  equivalent,  and  consequently 
either  can  be  used  as  the  basic  and  only  primitive  in  the  implementation  of  an 
operating  system  [LAUE781.  As  follows  from  our  model,  we  believe  that  it  is 
advantageous  to  have  both.  One  reason  is  the  semantic  difference  between 
local  and  remote  invocations,  as  discusses  earlier.  However,  the  whole 
concept  of  the  abstract  network  is  based  on  the  distinction  between  and  the 
comDination  of  message  passing  and  invocation.  The  abstract  network  makes  it 
explicit  wtien  an  operation  is  to  be  performed  on  an  object  that  belongs  to  a 
different  set,  where  the  difference  may  lie  in  the  logical  function  of  the 
set,  protection  constraints,  or  administrative  responsibilities.  For  a 
distriDuted  system  that  supports  several  different  applications  on  highly 
autonomous  nodes,  the  abstract  network  is  an  appropriate  model. 


6.  KELtABam  ISSUES 

In  tne  future,  reliability  will  be  one  of  the  major  issues  in  information 
processing  sytems.  This  claim  is  based  on  two  observations.  First,  the 
quantity  of  information  entrusted  to  a computer  system  is  ever  increasing. 
Second,  tne  complexity  of  the  operations  performed  by  a computer  is  also 
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increasing.  More  and  more  organizations  and  systems  are  dependent  on  computer 
maintained  information  and  a failure  of  these  computer  systems  can  often  be 
critical.  Thus  nigh  reliability  is  not  just  a requirement  for  real-time 
systems  controlling  space  vehicles  or  industrial  processes  failure  of  which 
would  endanger  human  lives. 

Reliability  of  an  information  processing  system  is  not  merely  a question 
of  software  correctness.  Hardware  failures,  synchronization  failures,  and 
errors  of  the  human  users  must  oe  anticipated  and  handled  gracefuLly.  The 
only  way  to  design  a reliable  system  is  to  make  it  "fault-tolerant",  or, 
robust  in  face  of  a large  variety  of  internal  failures  and  misuse. 

Distriouted  systems  are  often  claimed  to  be  inherently  more  reliable  than 
systems  based  on  a large  central  processor.  That  is,  given  that  a distributed 
system  is  properly  designed,  it  offers  better  reliability.  This  claim  is 
based  on  several  factors.  First,  distributed  systems  by  their  very  nature 
provide  opportunities  for  redundancy.  Second,  error  propagation  is  restricted 
by  physical  separation  of  processes  and  resources.  And  finally,  individual 
nodes  in  tne  distributed  system  may  be  less  complex  than  a large  central 
processor  and,  as  a result,  ought  to  have  lower  probability  of  failures. 
Basical Ly,  distributed  systems  have  a potential  for  being  more  reliable  than 
systems  based  on  a large  central  processor.  However,  this  potential  needs  to 
be  exploited  through  proper  design. 

For  the  purpose  of  the  discussion  of  reliability  issues,  the 
implementation  of  the  abstract  network  introduced  in  the  preceeding  section  is 
divided  into  two  levels:  the  appllcaton  level  and  the  system  level.  The 
system  level  is  all  the  mechanisms  needed  to  support  the  view  presented  to  the 
application  programmer  (that  is,  the  hardware  and  software  run-time  support  of 
tne  programming  system).  The  level  built  on  the  top  of  this  level  using  the 
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tools  available  to  the  application  programmer  is  referred  to  as  the 
app l icat ion  level . 

Reliability  mechanisms  are  those  mechanisms  that  assist  in  detection, 
reporting  and  recovery  from  errors  and  failures.  An  error  is  an  internal 
state  of  an  entity  that,  if  special  steps  (recovery)  are  not  taken,  will 
result  in  a failure  of  the  entity  (or,  in  the  case  of  data  objects,  failure  of 
in  operation  on  the  entity).  Some  errors  can  be  handled  entirely  by  the 
entity  itself,  and  thus  remain  invisible  to  other  entities  (e.g.  users  of  that 
entity);  such  errors  are  said  to  be  masked . Detected  errors  that  cannot  be 
Handled  internally  should  be  reported  to  the  users  by  signalling  a failure. 
Undetected  errors  also  turn  into  failures;  it  is  possible  that  a user  can 
detect  tnis  kind  of  failure,  but  the  problem  is  much  more  complex  than  with 
the  reported  failures. 

To  achieve  reliable  operations  from  the  application  point  of  view,  both 
the  system  level  and  the  application  level  have  to  include  mechanisms  for 
detection  and  handling  of  errors  and  failures.  For  each  type  of  error,  it  is 
necessary  to  decide  where  it  can  be  detected  and  how  it  should  be  handled. 

Some  classes  of  errors,  detected  within  the  system  level,  can  be  masked,  but 
for  others  a failure  lias  to  be  reported  to  the  application  level.  Other 
errors  are  application  dependent  and  therefore,  their  detection  and  handling 
must  be  done  at  the  application  level.  Basically,  in  the  case  of  system 
errors,  there  is  a gray  area  where  a decision  has  to  be  made  as  to  whether 
these  errors  will  be  masked  by  the  system  level  or  reported  as  failures  to  the 
application  level.  It  may  also  be  possible  that  an  attempt  to  mask  an  error 
falls;  it  is  necessary  to  decide  if  and  how  many  times  the  system  should 
repeat  the  attempt  before  reporting  a failure.  The  important  factors  in  these 
decisions  are  the  cost  and  complexity  of  the  masking  mechanisms  on  one  hand 
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and  the  convenience  to  the  programmer  on  the  other  hand.  Both  sides  have  to 
be  carefulLy  evaluated. 

Tne  system  must  provide  a means  for  detecting  and  correcting  or  reporting 
errors  arising  from  the  operation  of  the  hardware  and  the  software  that 
supports  the  application  programs.  However,  the  system  aLso  has  to  provide 
suitable  primitives  for  the  application  programmer  to  facilitate  handling  of 
the  application  specific  errors  and  communication  of  the  system  detected 
errors  to  tne  application  programs. 

6.1  Availability  And  Correctness 

Reliability  has  two  aspects  that,  unfortunately,  cannot  always  be 
separated;  in  particular,  their  solutions  may  conflict.  One  aspect  is  the 
availability  of  tne  entities  needed  to  perform  a specific  task.  The  other  is 
the  correctness  of  tne  available  entities;  a very  important  special  case  is 
the  integrity  of  the  stored  information. 

To  assure  correct  operation,  the  system  and  the  application  must  be 
prepared  to  handle  errors  that  originate  in  lower  levels,  in  particular, 
hardware  faults  and  possible  residual  bugs  in  the  software  that  comprises  both 
the  system  level  and  the  application  level.  It  is  also  necessary  to  be 
prepared  to  deal  with  errors  whose  source  is  the  user  of  an  entity.  Since  the 
user  may  be  a process  running  on  another  node,  these  latter  errors  may  be 
caused  by  hardware  or  software  failures  in  the  user's  node,  or  they  may  occur 
because  the  requesting  process  either  doesn't  know  how  to  use  the  requested 
entity  properly  or  is  trying  to  misuse  it  Intentionally.  Thus,  to  ensure 
correct  operation  of  an  entity,  it  is  necessary  to  ensure  both  that  the 
operations  on  that  entity  are  performed  correctly  in  9plte  of  possible 
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failures  of  the  node  on  which  the  entity  resides,  and  also,  it  is  necessary  to 
defend  the  entity  from  possible  misuse  by  other  processes. 

As  described  earlier,  processes  in  different  guardians  can  communicate 
only  by  sending  messages.  Objects  can  be  manipulated  only  within  their 
guardian.  To  protect  an  object  from  misuse,  it  is  necessary  to  ensure  that 
access  is  indeed  limited  to  the  guardian  and  that  all  incoming  messages  are 
carefully  scrutinized  to  determine  whether  the  request  is  reasonable  and  the 
effects  of  performing  requested  operations.  Within  the  guardian,  it  is 
necessary  to  provide  mechanisms  that  will  protect  the  resource  from  being 
damaged  or  lost  (made  inaccessible)  due  to  errors  arising  from  the  faults  in 
the  structures  that  implement  the  guardian. 

Availability  is  constrained  by  two  factors: 
i.  the  efficiency  of  tne  system,  that  is,  the  actual  physical  delay  and 
queueing  time  in  the  abstract  network,  and 
ii.  failures  in  the  abstract  network. 

Availability  has  several  connotations.  Firstly,  it  is  used  to  indicate 
whether  an  entity  is  useable,  that  is,  if  the  respective  process  will  execute 
the  operation  requested  once  the  request  is  brought  to  its  attention  (e.g. 
gets  to  the  head  of  the  queue).  Secondly,  it  is  used  to  indicate  whether  an 
entity  can  be  used  immediately,  or  whether  there  is  a contention  for  the 
entity.  ThirdLy,  and  this  aspect  plays  an  important  role  in  a distributed 
system,  it  may  be  used  to  indicate  whether  an  entity  is  accessible.  An  entity 
may  be  useable  and  unused,  but  the  path  to  it  may  be  broken. 

it  is  possinle  to  translate  all  three  aspects  into  the  problem  of  how 
long  it  is  necessjry  to  wait  for  a resource.  A useable  entity  may  not  be 
immediately  available  due  to  contention  for  the  entity,  but  also  due  to  long 
communication  delays;  if  the  common icat ion  path  is  broken,  the  communication 
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del  lys  may  be  unusuaLLy  Long,  even  infinite.  Similarly,  if  an  entity  is 
unuseanle,  the  wait  time  for  the  entity  to  become  useable  may  be  very  long, 
possiQly  infinite.  Since  in  a distributed  system  it  is  not  always  possible  to 
determine  the  cause  of  a Long  delay,  the  system  may  have  to  respond  to  poor 
performance  (due  to  overload)  in  the  same  way  it  responds  to  functional 
failures  of  the  resources  and  communication  paths.  Thus,  in  a way,  poor 
performance  (due  to  overload)  or  turns  into  a failure! 

From  all  three  of  these  aspects,  availability  can  be  enhanced  if  several 
instances  (copies)  of  an  entity  are  maintained  at  different  physical  nodes: 
i.  Coping  witn  fa i lures . If  3 node  fails,  or  commun icat ion  with  a 

particular  node  fails,  it  should  be  possible  for  processes  at  other 
nodes  to  continue.  That  means  that  entities  provided  by  the  failed 
(or  inaccessible)  node  have  to  be  provided  by  some  other  node(s)  in 
tile  remaining  operational  network,  (each  operational  partition  of  the 
network) . 

ii.  Coping  with  bo 1 1 lenecks . Even  if  the  nodes  and  the  communication 

network  of  a distributed  system  never  fail,  a single  instance  of  an 
entity  may  not  provide  sufficient  availability.  An  entity  may 
become  a bottleneck;  also,  the  communication  delays,  especially  in  a 
Long-haul  network,  may  be  substantial,  and  it  thus  may  be  desirable 
to  have  a Local  instance  of  the  entity  (and,  consequently,  support 
multiple  copies). 

In  the  systems  under  consideration,  the  most  important  type  of  object  is 
a data  object.  Maintaining  muLtiple  copies  of  data  objects  that  need  to  be 
frequently  updated  represents  a special  problem.  It  is  important  to 
distinguish  between  failures  and  bottlenecks  since  the  right  solutions  to  the 
probLem  of  mutual  consistency  are  significantly  different.  In  the  first  case, 
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only  one  copy  needs  to  be  actively  used,  that  is,  an  object  has  a master  copy 


and  one  or  more  backup  copies.  The  changes  made  to  the  master  copy  must  be 
propagated  to  the  backup  copies,  immediately  if  every  state  of  the  object  must 
be  recoverable,  or  periodically  upon  special  command  if  in  a case  of  a failure 
it  is  sufficient  to  back  out  ot  some  consistent  state,  not  necessarily  ttie 
last  consistent  state.  In  the  second  case,  all  copies  must  be  available  for 

active  use.  It  is  often  assumed  that  all  copies  must  always  be  the  same,  but 

this  requirement  may  defeat  the  very  purpose  for  which  the  multiple  copies 
were  introduced:  reduction  of  delays.  The  delay  caused  by  synchron izat ion  of 
updates  with  other  updates  and  accesses  of  multiple  copies  may  exceed  the 
delay  that  would  result  if  only  one  copy  were  maintained.  However,  it  is  not 
always  necessary  to  nave  the  most  current  version  of  an  object;  the 
information  obtainable  from  an  older  version  may  be  entirely  satisfactory. 

Inus  it  seems  much  more  realistic  to  allow  for  multiple  versions  of  an  object; 
the  local  copy  may  not  always  be  the  most  current  version,  but  the  most 
current  version  is  known  and  a local  copy  of  it  can  be  obtained  upon  request. 

Tne  system  level  ought  to  support,  in  a selective  way,  the  kind  of 

redundancy  required  to  cope  witn  failures.  The  other  case  is  more  complex  and 
more  application  dependent.  The  fact  that  there  exist  several  versions  of  an 
object  may  need  to  be  visible  not  just  to  the  application  programmer,  but  to 
the  application  user.  Thus,  the  solution  should  be  left  to  the  application 
level;  the  system,  however,  ought  to  provide  mechanisms  to  make  the  solution 
possiole.  A mechanism  for  maintaining  multiple  versions  of  objects  in  such  a 
way  that  a consistent  version  of  a set  of  objects  can  always  be  obtained  was 
developed  by  Reed  [REED78] . In  addition,  Reed's  mechanism  solves  the  problem 
of  updates  and  backout  in  a distributed  system  in  a most  natural  way. 
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However,  to  prevent  loss  of  information,  the  most  current  version  ought  to 
have  at  least  one  backup  copy.  The  scheme  that  combines  multiple  versions  and 
backup  copies  is  sketched  in  Figure  2. 

6.2  Reliable  Communication  Subsystem 

The  communication  suosystem  is  the  part  of  the  system  level  that  delivers 
messages  between  physical  nodes.  This  subsystem  consists  of  the  actual 
pnysical  network  of  communication  lines  and  of  communication  processes  that 
control  tne  delivery  of  nessages.  Figure  3 shows  relationship  between  the 
abstract  network  and  the  commun icat ion  subsystem.  The  application  processes 
exchange  messages  that,  logically,  contain  values  of  high  level  (abstract) 
objects  meaningful  at  that  level.  The  values  of  these  objects  have  to  be 
translated  (encoded)  into  a string  of  bits  for  delivery  to  another  node  and 
decoded  into  the  proper  abstract  objects  at  the  receiving  node.  At  the  system 
level,  messages,  now  in  tne  form  of  bits,  may  have  to  be  partitioned  into 
packets.  Tne  messages  are  checksummed,  so  that  transmission  errors  can  be 
detected.  It  is  difficult  to  correct  transmission  errors  at  the  receiving 
node,  since  transmission  errors  are  bursty  (affect  not  just  a single  bit,  but 
several  bits).  Cnecksum  facilitates  detection  of  errors,  where  the  number  of 
detectable  simultaneous  errors  is  determined  by  the  size  of  the  checksum 
field.  Correction  is  performed  through  retransmission.  In  general,  once  a 
message  has  oeen  translated  into  a string  of  bits,  the  protocols  used  by  the 
communication  processes  should  take  care  of  the  correct  transmission,  that  is, 
either  a correct  message  is  delivered,  or  nothing  is  delivered.  However,  the 
primary  responsibility  for  checking  that  a message  has  been  acted  on,  that  is, 
ensuring  that  a process  that  sent  a message  will  not  wait  indefinitely,  and 
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0,,  master  copy  of  the  object  (current  version) 

0g  backup  copy  of  the  object  (current  version) 

0 object  version  i (current  or  older  version) 

U,  users 

k 

> update  requests 

■>  propagation  of  changes 

read  requests 

f Figure  2 : Multiple  version  scheme  with  backup. 
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Ni  nodes 

<  > possible  communications  on  the  application  level 

<  £»  flow  of  information  in  the  network 

Figure  3:  The  abstract  network:  communications  on  and  between  the  application 

and  the  communication  subsystem. 


also  that  the  message  contains  values  acceptable  from  the  application 
standpoint,*  must  rest  witn  the  application. 


One  of  the  most  difficult  problems  In  this  type  of  distributed  system  Is 
that  unless  ait  explicit  reply  (or  an  acknowledgement ) is  received,  it  is 
impossible  to  determine  with  certainty  whether  a message  sent  to  a process  at 
a different  physicaL  node  has  been  received  and  if  the  receiving  process  has 
acted  on  it.  The  only  defense  tgainst  possibly  waiting  indefinitely  for  a 
response  is  to  use  a timeout  mechanism.  The  sender  of  a message  can  specify  a 
Lime  Interval  ifter  which  iL  gives  up  waiting  for  the  response;  the  timeout 
nechantsm  will  alert  the  sender  when  such  a time  interval  has  elapsed.  The 
possible  reactions  of  the  sender  to  a timeout  event  can  be  divided  into  two 
categories:  the  sender  decides  to  give  up  the  attempt  to  communicate  with  the 
particul  ir  process,  or,  the  sender  decides  to  resubmit  the  request.  Because 
of  the  uncertainty  discussed  above,  it  is  possible  that  the  first  request  will 
eventually  be  processed.  Tims,  In  the  first  case,  the  request  may  be 
processed  in  spite  of  the  sender's  decision  not  to  continue  and  may  conflict 
with  the  subsequent  lotions  taken  by  the  sender  after  the  timeout.  In  the 
second  cise,  the  same  request  may  be  processed  twice,  possibly  leading  again 
to  an  inconsistency.  Thus,  in  situations  where  an  inconsistency  may  arise 
form  such  internode  requests,  it  is  necessary  to  use  special  (often  complex) 
protocols  [L.AMl’76,  GRAY  78 , RF.ED7S,  MONT78,  TAKA78).  The  questLon  that  arises 
at  this  point  is  if  such  protocols  ought  to  be  a part  of  the  communication 
subsystem.  We  believe  that  many  such  situations  represent  special  cases  that 


* A message  may  contain  a higher-level  error:  either  a message  has  noL  been 
constructed  properly  by  the  application  process  (wrong  command  or  wrong  data) 
or  tile  translation  from  abstract  data  to  the  bit  representation  has  not  been 
done  correctly. 
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are  more  appropriately  handled  at  the  application  level.  However,  oar 
language  shouLd  provide  constructs  for  detection  of  duplicate  messages  at  the 
application  level. 

The  reliability  of  the  communication  subsystem  could  be  increased  through 
the  use  of  recoverable  queues.  That  is,  in  addition  to  dealing  with  the 
commun icat ion  errors  that  result  in  a loss  or  garbling  of  messages  sent  across 
a physical  commun icat ion  link,  the  communication  subsystem  can  guarantee  that 
messages  that  nave  been  presented  to  it  by  the  application  processes  and 
queued  for  delivery  (tnat  is,  messages  accepted  by  the  communication 
subsystem)  will  not  be  lost  if  the  node  fails.  This  degree  of  reliability  may 
be  important  if  translation  from  an  abstract  data  object  to  the  corresponding 
bit  representation  is  a costly  operation,  or  if  the  input  to  such  a 
translation  step  is  not  automatically  repeated  (e.g.  message  typed  by  a user). 
Tnis  argument  can  be  extended  to  the  requirement  that  the  communication 
suDsystem  should  guarantee  de  1 iver y of  all  messages  it  has  accepted  from  the 
application  processes.  That  means  that  in  addition  to  providing  recoverable 
queues  for  messages  that  have  not  been  sent  yet,  the  communication  subsystem 
must  continue  trying  to  send  the  queued_  messages  until  it  eventually 
succeeds.  At  the  receiving  node,  the  messages  have  to  be  stored  again  in 
recoverable  queues  until  they  are  picked  up  by  the  target  application  process. 
Unless  the  target  node  is  disabled  permanently,  it  is  possible  to  ensure  that 
each  request  either  will  be  processed  or  it  can  be  determined  that  the  request 
could  not  be  processed  and  why.  Of  course,  the  sending  application  process 
may  have  to  wait  for  a long  time  for  the  outcome  of  its  request,  so  it  may 
still  be  desirable  to  use  a timeout  on  the  application  level. 

Tne  reliability  mechanisms  do  represent  potentially  large  overhead,  and 
their  use  should  not  be  imposed  on  all  communications.  The  basic 


communication  scheme  should  be  simple,  fast,  and  inexpensive.  Rather  than 
insisting  on  the  guaranteed  delivery  scheme,  we  will  investigate  whether  it  is 
possible  to  vary  the  degree  of  reliability  provided  by  the  system  by  letting 
the  application  programmer  choose  from  several  different  protocols,  where  such 
protocols  would  be  implemented  as  extensions  (abstractions)  built  from  the 
basic  protocol. 


7.  PROTECTION  ISSUES 

In  the  class  of  distributed  systems  considered  in  our  project,  a likely 
case  is  that  a particular  node  is  utilized  by  one  user  or  at  most  by  a set  of 
cooperating  and  mutually  trusting  users.  In  this  case,  intra-node  protection 
mechanisms  are  not  required  to  have  power  sufficient  to  protect  against 
suoversion  and  malice.  This  is  in  strong  contrast  to  a system  such  as  Multics 
[SALT74I , and  many  otiier  time-snared  and  multiprogrammed  systems  that  were 
designed  to  operate  properly  with  a set  of  mutually  nostile  users.  What  is 
required  witnin  a single  node  is  a mechanism  that  protects  against  error  and 
forgetfulness.  Inter-node  protection,  on  the  other  hand  must  be  able  to  deal 
with  tne  potentially  hostile  environment:  1)  individual  nodes  are  autonomous, 
that  is,  it  is  not  possible  to  assume  that  they  will  behave  as  desired  by 
other  nodes,  and  2)  the  communication  lines  between  nodes  in  general  cannot  be 
pnysically  secured. 

We  assume  that  a capability  mechanism  will  be  the  basic  mechanism  used 
for  intra-node  protection.  By  capability  we  mean  an  unforgeable  identifier 
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for  an  object  that  tdentifi.es  the  type  of  the  object*  and  that  must  be 
presented  as  part  of  addressing  an  object.  By  constraining  a procedure  to 
execute  witn  a Limited  coliection  of  capabilities,  it  is  easy  to  guarantee 
that  the  procedure  wiii  not  do  arbitrary  damage  to  stored  information. 

In  tne  context  of  our  modeL,  the  efficiency  of  capabilities  in  comparison 
to  an  alternative  mechanism  such  as  access  control  lists  becomes  very 
important.  Since  we  assume  a world  witli  a large  number  of  small  objects,  it 
is  clearly  impossible  to  imagine  that  every  object  comes  complete  with  an 
access  control  list;  the  overhead  of  the  access  control  might  be  substantially 
larger  than  the  object  itself.  Capabilities,  on  the  other  hand,  need  be  no 
more  than  slightly  enlarged  addresses.  We  thus  propose  that  the  intra-node 
protection  mechanism  is  based  on  capabilities,  with  some  sort  of  capability 
cataloging  mechanism  playing  the  role  now  associated  with  the  traditional  file 
system . 

The  inter-node  protection  is  more  a matter  of  policy  than  of  mechanism. 

We  believe  that  protection  between  nodes  should  be  based  on  an  access  control 
List  mechanism  rather  than  a capability  mechanism.  This  claim  is  not  based  on 
difficulty  of  implementation;  either  mechanisn  can  be  imagined.  Rather,  it  is 
based  on  our  perception  of  the  high  level  needs  of  distributed  applications. 

A fundamental  way  to  characterize  the  difference  between  capabilities  and 
access  control  lists  is  that  capabilities  do  not  provide  any  easy  answer  to 


* "Capability"  if  often  used  to  mean  more  than  an  unforgeable  identifier:  a 
capability  may  llso  include  a specification  of  the  access  rights,  that  is,  a 
spec i f Lcat ion  of  which  of  the  operations  defined  for  the  type  of  the  object  in 
question  are  acutalLy  allowed  on  that  specific  object.  Alternatively,  access 
controL  could  also  be  achieved  by  making  the  object  appear  to  be  of  the  type 
that  imposes  the  desired  restrictions.  The  desirability  of  including  the 
access  rights  in  the  capability  and  the  feasibility  of  the  other  approach 
(especially  in  connection  with  providing  different  "views"  of  data  bases)  will 
be  investigated. 
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the  question  "Who  are  ulL  the  people  who  can  get  to  this  object?",  whiLe 
access  control  lists  make  it  very  difficult  to  ask  the  question  "What  are  all 
the  objects  that  l can  get  to?".  If  one  considers  real  world  protection 
problems,  including  those  drawn  from  domains  other  than  the  computer  domain, 
the  more  workable  model  of  protection  generally  turns  out  to  be  that  based  on 
access  control  lists.  While  capahillties  are  often  used  in  the  real  world, 
the  most  obvious  example  being  keys,  the  drawbacks  are  well  known.  Keys  are 
subject  to  unauthorized  duplication,  loss,  thef't,  etc.  More  relevantly, 
capabilities  (or  keys)  do  not  provide  a means  to  support  accountability. 

7.1  Protection  In  Tne  Abstract  Network 

The  intra-node  protection  problem,  while  less  severe  than  the  problem 
that  results  from  fully  suspicious  cooperation,  is  still  not  trivial.  The 
programmer  must  oe  provided  a means  of  partitioning  his  computations,  so  that 
certain  objects  are  accessible  only  in  certain  computations.  This  mechanism 
will  aLlow  him  to  debug  new  versions  of  software  without  running  the  risk  of 
destroying  existing  objects;  however,  such  a mechanism  is  also  desirable  after 
the  debugging  stage,  to  limit  the  effect  of  undetected  (unreported)  failures 
that  may  arise  from  residual  bugs  or  hardware  failures.* 

The  abstract  network  presented  in  Section  5 assumes  such  a partitioning: 
the  local  address  spaces  of  individual  guardians  define  protection  domains 
that  are  assumed  to  be  mutually  exclusive.  Thus,  the  guardian  (abstract  node) 
is  an  important  logical  unit  of  protection.  Within  a guardian,  it  is 
sufficient  to  provide  only  the  simplified  capabilities  discussed  earlier. 
Between  guardians  only  messages  can  be  exchanged;  there  is  no  physical  sharing 

* This  subject  is  also  included  in  the  discussion  of  reliability  issues. 
Indeed,  there  is  a strong  overlap  between  reliability  and  protection,  both  in 
their  definitions,  and  in  the  mechanisms  used. 
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of  data.  A.  guardian  can  scrutinize  all  incoming  messages,  and,  using  access 
control  Lists  and  authentication,  can  vaLidate  the  request.  However,  not 
every  guardian  need  use  this  form  of  protecton.  Just  as  the  implementation  of 
a procedure  can  make  use  of  procedures,  so  may  the  implementation  of  a 
guardian  be  defined  in  terms  of  other  guardians.  Thus,  some  guardians  are 
more  autonomous  than  others;  by  design,  some  guardians  will  receive  messages 
only  from  "friendLy"  guardians.  Therefore,  the  system  cannot  automatically 
enforce  protection  constraints  either  at  physical  node  boundaries  or  3t 
abstract  node  boundaries. 

An  important  consideration  in  maintaining  autonomy  of  the  individual 
nodes  in  a distributed  system  is  the  control  over  installation  of  the 
guardians.  vJe  are  assuming  that  guardians  are  always  created  at  the  same 
(pnysicaL)  node  as  the  creating  guardian  (process).*  Remote  guardians  can  be 
created  only  be  requesting  some  already  existing  guardian  at  the  remote  node 
to  do  the  creation.  This  approach  implies  that  each  node  must  come  into 
existence  with  (at  least)  one  guardian,  the  primal  guard ian . The  primal 
guardian  is  tailored  by  the  owner  of  the  node,  prior  to  adding  the  node  to  the 
network,  by  specifying  who  has  the  right  to  install  guardians  at  that  node  as 
part  of  a particular  application. 

7.2  Protection  Agents 

Let  us  look  now  at  tne  inter-node  protection  problem  from  a slightLy 
different  viewpoint.  It  will  be  a rare  case  where  a request  occurring  between 
nodes  consists  of  nothing  more  than  the  reading  or  writing  of  a single 
primitive  object.  In  most  cases,  we  can  expect  the  request  to  be  composed  of 
in  aggregate  of  reads  and  writes  on  various  objects,  which  the  requesting  node 

* The  same  is  true  for  objects. 
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views  as  atomic.  This  is  generally  referred  to  as  an  atomic  transaction.  The 
thing  that  must  be  protected  from  outside  is  the  right  to  execute  this  atomic 
transaction.  It  is  possible  that  the  isolated  reads  and  writes  required  as 
part  of  this  transaction  are  not  legitimate  for  the  outside  user  except  as  a 
part  of  this  or  some  other  transaction.  A typical  example  is  a situation 
where  the  user  is  allowed  to  see  some  statistics  but  not  the  individual  items 
in  the  actual  set  from  which  the  statistics  are  divided. 

To  express  such  higher  level  protection  constraints  concisely,  the 
Language  used  to  describe  the  inter-node  requrests  must  have  primitives  that 
will  it  easy  to  confirm  whether  or  not  a proposed  transaction  falls  within  the 
bounds  of  the  outstanding  protection  constraints.  Alternatively,  it  is 
possiole  to  visualize  inter-node  messages  that  contain  an  arbitrary  algorithm 
expressed  in  terms  of  a sequence  of  operations  on  a set  of  objects;  such  an 
algorithm  would  have  to  be  examined  and  confirmed  at  the  receiving  node  before 
execution  to  ensure  that  it  conforms  to  the  higher  level  security  constraints. 
However,  the  construction  of  a verification  algorithm  that  ensures  that  an 
arbitrary  program  conforms  to  one  of  the  number  of  nigh  level  protection 
constraints  wouLd  be  a challenge  to  the  most  optimistic  of  the  program 
verification  researchers;  thus  this  approach  must  be  rejected. 

We  postulate  the  idea,  common  in  data  management  systems,  that  different 
users  of  a data  base  have  different  views  of  the  data  base,  often  called 
different  data  models.  From  the  outside,  the  data  model  appears  to  describe 
physically  stored  information  and  the  acceptable  operations  on  it.  However, 
internally,  the  data  modeL  may  have  little  correspondence  to  the  information 
actually  stored.  Rattier,  it  may  be  realized  as  algorithms  that  derive  the 
modeled  data  from  the  shared  information.  Thus,  we  see  users  first  being 
divided  into  large  groups,  based  on  which  data  model  they  use,  and  then  within 
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Chose  groups  being  further  divided  according  to  which  operations  they  can 
perform  on  the  data  model  provided.  For  example,  some  users  may  be  able  to 
read  certain  records,  others  to  read  and  write  them.  Each  data  model  implies 
the  existence  of  an  algorithm  to  translate  between  that  data  model  and  the 
stored  information.  It  is  these  algorithms  that  must  be  provided  in  advance, 
one  set  for  eacn  data  model.  Tne  programming  system  must  provide  facilities 
for  creating  such  data  models,  mapping  them  into  the  actual  stored 
information,  and  synchronizing  read  and  write  operations  originating  from 
different  data  models. 

Higher  level  protection  constraints  can  be  expressed  through  the  use  of 
abstract  types.*  Users  of  an  abstract  object  are  permitted  to  manipulate  the 
object  only  through  the  operations  defined  by  the  object's  type;  in  reality, 
an  arbitrary  computation  may  be  performed  on  a possibly  large  number  of 
objects  that  constitute  the  representation  of  the  given  abstract  object.  In 
addition,  individual  users  may  be  restricted  to  only  a subset  of  the 
operations  defined  for  an  abstract  type.  However,  the  concept  of  the  data 
model  calls  for  yet  something  more.  The  traditional  view  of  data  models 
permits  a low-level  information  entity  to  be  shared  by  different  users  through 
a variety  of  data  models.  To  support  this  view  via  abstract  types,  it  must  be 
possible  to  manipulate  a single  low-level  object  as  part  of  a number  of 
different  abstract  data  objects,  depending  on  the  rights  of  the  different 
users.**  Tne  idea  of  data  models  is  that  different  users  have  different  views 


* Several  research  projects  on  the  use  of  abstract  types  In  implementation  of 
data  base  management  systems  are  under  way  [SMIT77,  SCHM77]. 

**  This  can  be  a very  difficult  problem.  In  the  context  of  relational  data 
bases,  for  anything  but  very  simple  views,  "propagation  of  updates  from  views 
to  base  relations  becomes  complicated,  dangerous,  and  sometimes  impossible" 

[ CHAM 75 1 . 
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of  Che  world,  but  fundamentally,  they  do  turn  out  to  be  views  of  the  same 


world.  Thus,  In  some  sense,  they  must  ultimately  rest  on  the  same  physical 
data. 

The  inter-node  protection  can  be  enforced  as  follows.  Any  outside  user 
(process)  perceives  the  information  in  a particular  (abstract)  node  as  a 
number  of  objects  that  he  can  manipulate  independently,  and  a set  of 
permissible  operations  on  those  objects.  These  externally  visible  objects  are 
arranged  in  such  a way  that  there  are  no  explicit  protection  constraints  that 
tie  one  object  to  another.  A message  arriving  at  a node  to  manipulate  one  of 
these  objects  must  be  processed  by  some  active  entity  (e.g.  a process)  that 
confirms  that  the  requestor  of  the  action  has  the  right  to  perform  it,  and 
then  implements  the  operation  at  the  node  by  invoking  operations  on  other 
objects,  not  directly  accessible  from  outside.  We  will  refer  to  this  entity 
as  a protection  agent.  The  protection  agent  could  be  the  same  entity  as  the 
guardian.  The  guardian,  however,  presents  only  a single  view  of  the  guarded 
objects.  The  protection  requirements  described  above  lead  to  a model  outlined 
in  Figure  4.  In  this  model,  the  high  level  protection  function  is  separated 
from  the  low  level  protection  and  synchronization  function  of  the  guardian. 

Tne  protection  agents  represent  different  views  of  the  guarded  objects.  The 
guardian  controls  the  actual  physical  access  to  the  guarded  objects;  it 
imposes  synchronization  constraints  on  requests  passed  to  it  from  the 
protection  agents. 

While  there  are  an  infinite  number  of  protection  checks  that  the 
protection  agent  may  wish  to  perform  in  a particular  case,  there  are  three 
checks  that  can  normatly  be  expected  to  occur.  First,  the  protection  agent 
will  wish  to  confirm  that  the  originator  of  the  message  has  the  right  to 
invoke  the  protection  agent  at  all.  Second,  the  protection  agent  may  wish  to 
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confirm  tnat  the  particular  object  or  objects  invoLved  in  the  operation 
requested  are  indeed  accessible  to  the  requestor,  and  third,  that  the 
particular  operation  to  be  performed  on  the  object  is  permitted  for  the 
requestor.  For  example,  a data  base  manager  may  wisn  to  confirm  that  the 
requestor  can  invoke  it,  that  the  particular  record  being  manipulated  is 
accessible  to  the  requestor,  and  whether  or  not  updates  or  just  reads  are 
permissible  for  that  requestor.  In  any  particular  case,  one  or  more  of  these 
steps  may  be  omitted.  Thus,  for  example,  provided  that  a user  has  a right  to 
invoke  a protection  agent  at  all,  he  may  have  the  right  to  manipulate  any 
object  normally  made  accessible  through  that  agent.  He  may  also  be  permitted 
to  perform  my  of  tne  oper.itions  defined  on  the  objects.  In  this  case,  only 
the  first  of  the  three  tests  need  be  performed. 

We  should  now  pause  and  consider  how  this  representation  of  protection 
meshes  with  tne  conclusion  drawn  earlier  that  inter-node  protection  should  be 
expressed  in  terms  of  iccess  control  lists.  Clearly,  the  use  of  access 
control  lists  implies  that  the  protection  agent  must  be  able  to  determine 
reliaoly  the  originator  of  every  message.  Using  the  terminology  developed  for 
characterization  of  protection  mechanisms  in  a centralized  system,  we  will 
assune  that  every  message,  3t  its  origin,  has  associated  with  it  a principal 
identifier,  wnich  identifies  the  entity  to  be  held  accountable  for  the  request 
in  the  message.  Some  technique  such  as  encryption  will  be  used  to  ensure  the 
be l ievnD  i l ity  of  the  principal  id  by  the  recipient  of  the  message,  if  the 
message  has  originated  from  a different  physical  node  than  the  recipient 
[kKNl/b,  MEED78] . Using  the  principal  id,  the  first  protection  check 
descrioed  above  is  easy  to  implement.  We  can  associate  with  every  protection 
$ agent  an  access  control  list,  and  insist  that  the  principal  identifier 

associated  with  the  message  be  on  that  list  before  the  protection  agent  can  be 
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invoked  at  aLL.  The  second  test,  that  of  ensuring  that  this  principal  is 
allowed  to  manipulate  the  particular  objects  in  question,  can  be  handled  in  a 
variety  of  ways.  One  obvious  technique  is  to  associate  with  each  entry  In  the 
access  control  list,  a list  of  ail  the  objects  that  the  particular  principal 
is  allowed  to  use.  The  protection  agent  can  then  refer  to  the  list  to 
determine  tne  access  privileges  of  the  requestor.  If  the  third  type  of 
protection  check  is  required,  it  can  be  implemented  as  part  of  this  same  list, 
by  associating  witn  the  entry  for  each  object  a notation  describing  the 
operations  that  this  principal  is  permitted  to  perform  on  that  object. 


8.  LANGUAGE  CONSTRUCTS 

The  desirability  of  various  kinds  of  language  constructs  has  been  pointed 
out  in  several  places  in  the  proceeding  discussions.  We  see  the  language 
design  to  be  the  main  vehicle  of  our  project.  The  language  has  to  provide  an 
effective  interface  for  the  application  programmer,  while  at  the  same  time  it 
is  necessary  to  take  into  consideration  all  the  problems  arising  from  the 
distributed  processing  environment.  This  section  presents  some  examples  of 
the  planned  language  constructs. 

Guardians  will  be  a major  type  of  module  available  for  program 
construction.  A guardian  is  a program  whose  purpose  is  to  protect  a resource. 
For  example,  a guardian  might  guard  an  entire  data  base,  a partition  of  that 
data  base,  a physical  resource,  e.g.  an  I/O  device,  or  an  abstraction  of  such 
a device.  The  guardian  may  be  thought  of  as  a local  collection  of  processes 


and  data;  the  processes  within  the  same  guardian  may  share  the  data,  but  they 
communicate  with  processes  in  other  guardians  only  by  message  passing.  As  was 


user  of  a guardian  is  aware  of  its  behavior:  the  kinds  of  messages  that  can 
be  addressed  to  (processes  in)  it,  and  the  kinds  of  repsonses  it  makes  to 
these  messages.  The  fact  that  the  guardian  is  implemented  in  terms  of  Lower  ‘ 


Level  guardians  is  invisible  to  the  user. 

An  important  issue  in  designing  a Language  for  a distributed  system  is 
the  primitives  for  sending  and  receiving  messages.  The  basic  scenario  in  the 
abstract  network  is  one  process  sending  a message  to  another  process 
requesting  some  action;  Later  there  should  be  another  message,  f Lowing  in  the 
otner  direction,  indicating  the  result  of  the  action.  Lt  must  be  possible  to 
express  in  the  Language  that  a particular  message  is  a repLy  to  another 
message.  In  addition,  it  is  necessary  to  address  the  problem  that  the  reply 
may  never  arrive,  or  that  the  request  message  cannot  be  sent.  Possible 
ipproaches  differ  in  how  Long  (for  what  event)  the  sending  process  must  wait 
oefore  it  can  proceed.  CloseLy  reLated  to  this  degree  of  waiting  is  how  the 
Language  will  handle  failures. 

8.1  lue  Send  Command  That  Waits  For  A RepLy 

in  tiiis  ipproach,  the  sending  process  is  forced  to  wait  until  the 
response  comes  oack  from  the  receiver,  or  some  timeout  or  failure  resuLts.  A 
possibLe  syntax  might  be: 

Send  C(args)  £o  A 1 imeout  Time: 

Rl ( formaLs)  Oo  SI ; 

R2(  formals)  S2; 

failure  (formals)  D£  Sfailure; 

timeout  l)o  St  imeout; 

End ; 
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Here  A is  a process  and  C(args)  is  the  message,  consisting  of  a command,  C, 
and  some  arguments.  The  remainder  of  the  construct  lists  the  various  possible 
responses,  together  with  the  appropriate  action  to  be  taken  by  the  sending 
process.  R1 , R2,  etc.,  are  responses  for  A;  some  might  be  normal,  and  some 
abnormal.  "Failure"  covers  various  failures  chat  are  detected  either  by  the 
system  or  by  the  receiving  process  A.  Tne  arguments  of  "faiLure"  specify  the 
type  of  failure.  Some  examples  of  a detected  failure  are: 
a)  the  message  as  specified  cannot  be  constructed, 
d)  tne  specified  process  (A)  no  longer  exists, 

c)  tne  target  node  is  inaccessible, 

d)  congestion  (tne  target  node  of  the  target  process  (A)  does  not  have 
enough  buffer  space), 

e)  the  message  cannot  oe  decoded  (the  abstract  objects  contained  in  a 
message  cannot  be  reconstructed). 

Which  of  tnese  failures  are  visible  at  the  application  level  depends  on  the 
design  of  the  system  level.  As  discussed  in  Section  6,  the  system  level  might 
be  designed  in  such  a way  that  message  delivery  is  guaranteed.  This  would 
eliminate  tne  need  to  cope  witii  the  failures  of  the  type  c and  d at  the 
application  Level. 

The  timeout  action  is  taken  if  "time"  is  exceeded.  If  the  system  is 
designed  for  guaranteed  delivery,  the  timeout  action  that  terminates  the  send 
command  should  reLease  the  buffers  in  wtiich  the  system  keeps  the  message  for 
deLivery  to  the  target  application  process.  It  should  be  understood  that  this 
timeout  is  for  the  pair  of  messages  to  be  exchanged  between  processes  in  the 


abstract  network;  a different  timeout  vaLue  is  used  in  the  underlying  system 


to  govern  retransmission  of  packets.* 

A different  kind  of  "send"  command  is  needed  in  the  receiving  process, 
since  the  receiving  process  must  be  abLe  to  respond  to  the  command  without 
waiting  for  tne  original  sender  process  to  respond  back.  To  receive  messages, 
A might  use  a construct: 

Command  case 

C(formaLs)  do  ...  re  pi v R(args)  ; . . . ; 

End ; 

ilere,  A is  waiting  for  one  of  a number  of  messages;  if  several  are  avaiLabLe, 
one  is  selected  in  i fair  way.  The  message  is  then  decoded,  the  contained 
data  issigned  to  tne  formils,  ind  the  statements  associated  with  the  selected 
message  ire  exe  aid.  Tne  move  form  of  Reply  Command  sends  i message  back  to 
the  process  that  seat  tae  message;  tne  identity  of  the  process  to  reply  to  is 
Hit  i.natically  cxtriteJ  from  the  received  message.  Another  form  of  reply: 

Reply  R(urgs)  Id  B 

that  explicitly  names  tne  process  to  reply  to  will  probably  also  be  needed; 
tnis  would  permit  a third  process  to  be  the  replier  to  the  original  sender. 

Tne  approach  sketched  move  has  tne  obvious  advantage  of  pairing  sends 
»nd  receives.  It  ilso  lias  some  obvious  disadvantages.  For  one  thing,  there 

* Tne  "timeout"  parameter  presents  a rather  difficult  programming  problem  — 
it  requires  the  programmer  to  be  aware  of  the  physical  timing  of  the 
communication  subsystem  and  the  time  needed  by  the  recipient  to  process  the 
request.  However,  unless  the  guaranteed  delivery  approach  is  adopted,  this 
timeout  is  essentiil  to  cope  with  failures.  Although,  as  argued  in  Section  6, 
some  form  of  timeout  is  also  desirable  in  the  guaranteed  delivery  scenario, 
possibly  a more  intelligent  way  of  specifying  a timeout  condition  could  be 
dev  i s e d • 
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are  two  send  commands.  More  Important,  however,  is  the  Loss  of  parallelism, 
if  the  sending  process  has  other  tasks  to  do  while  its  request  is  being 
processed,  it  must  either  not  do  them,  thus  reducing  efficiency,  or  it  must 
spawn  another  process  to  do  these  tasks.  Thus  a language  supporting  this 
approach  must  provide  rich  facilities  for  parallelism.* 

S.2  Separation  Of  Send  Command  And  Reply  Processing 

Tne  second  possible  approach  is  the  opposite  of  the  waiting  approach: 
the  sending  process  does  not  wait  at  all  but  continues  running,  performing 
actions  on  local  objects,  or  possibly  sending  more  messages.  Only  when  it 
needs  a response  that  is  not  ready,  it  will  have  to  wait.  The  language  now 
has  to  provide  additional  constructs  that  allow  the  programmer  to  distinguish 
wnich  response  goes  with  which  request  and  to  specify  that  the  process  wishes 
to  wait  for  the  reply  to  a specific  message,  rather  than  a reply  to  any 
message  that  may  have  been  previously  sent. 

There  are  various  ways  in  which  these  problems  might  be  solved.  For 
example,  send  commands  might  be  LabeLled: 

SI:  Send  CL(args)  fo  Al; 

S2:  Send  C2(args)  £o  A2; 

Get  SI  Rep  1 ies : . . . ; 

in  this  approach,  eacli  Send  Command  has  a continuation  (as  in  Actors  [HEW176]) 
that  can  be  named  to  identify  the  responses  of  interest  in  Rep l ies . Following 
Replies  Would  be  the  List  of  alternative  responses,  as  shown  in  the  preceding 
section,  to  the  message  sent  by  statement  SI.  Note  that  the  errors  arising  in 


* Note:  this  is  not  the  only  reason  for  which  such  facilities  for  parallelism 
might  be  needed.  See  the  discussions  of  guardians. 
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Cuming  Ci(args)  into  a message  wouid  be  exceptions  (abnormal  terminations)  of 
tne  Send  Command;  failures  such  as  (c),  (d)  and  (e)  described  earlier  would 
have  to  be  reported  outside  of  the  Send  Command  (in  Replies) . 

Another  possible  approach  is  to  use  ports: 

Send  C(args)  £o  A Rep  1 y-to  P 

where  P is  a port  that  can  be  named  by  more  than  one  Send  Command.  Ports 
offer  flexibility  in  expressing  different  patterns  of  requests  and  replies, 
both  between  a single  pair  of  processes  and  in  cases  wnere  a process 
communicates  with  several  other  processes. 

Tnis  approach  permits  parallelism  and  is  more  flexible,  especially  in 
connection  with  ports.  However,  the  linguistic  mechanisms  needed  to  enable 
the  programmer  to  do  the  matching  introduce  extra  complexity;  how  much 
flexibility  is  gained  and  how  much  complexity  is  added  requires  further  study. 
Note  that  this  approach  does  not  eliminate  the  need  for  supporting  timeout, 
but  now  the  timeout  is  specified  at  the  point  where  the  process  must  wait  for 
the  reply. 

8.3  The  In-Between  Approach 

The  third  approach  is  to  make  the  sender  wait  for  some  indication  about 
the  progress  in  the  delivery  of  the  message  instead  of  waiting  for  the  reply 
from  tne  target  process.  For  example,  in  Hoare's  language  [HOAR77),  the 
sender  waits  until  the  replier  receives  the  message. 

The  first  question  to  ask  is:  does  this  approach  offer  the  programmer 
any  advantages  over  the  other  two  approaches?  Since  sends  and  replies  are  not 
explicitly  paired,  from  this  point  of  view  in  the  in-between  approach  offers 
similar  advantages  and  disadvantages  as  the  no-wait  approach  described  in 
Section  8.2.  What  is  gained  over  the  no-wait  approach  is  that  certain 
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failures,  for  example  (c)  and  (d) , or  possibly  even  (e)  can  be  treated  as 
exceptions  of  tne  send  command.  More  importantly,  tbe  completion  of  the  send 
command  indicates  that  a meaningful  message  (to  some  extent)  has  been 
received.  However,  the  completion  of  the  send  does  not  guarantee  that  the 
message  will  be  processed  (e.g.  the  receiver's  node  might  fail  immediately 
after  the  message  is  received).  It  should  be  noted  that  there  is  a 
substantial  loss  of  parallelism  over  the  no-wait  approach. 

The  in-between  approach  is  often  advocated  on  implementation  grounds,  as 
a means  to  prevent  flooding  of  the  receiver.  Flooding  means  that  messages  are 
delivered  faster  than  the  receiver  can  process  them.  Since  the  buffer  space 
of  the  receiver  is  always  limited,  either  some  control  must  be  provided  to 
stop  the  flow  of  messages  or  some  messages  must  be  discarded.  In  a system 
with  shared  memory,  a very  efficient  implementation  is  possible,  namely,  each 
process  has  one  send  buffer,  and  the  message  is  held  there  until  the  receiver 
wants  it.  In  a system  without  shared  memory  such  as  our  distributed  system, 
this  approach  is  clearly  impractical,  since  extra  messages  would  be  needed  to 
inform  the  communicating  parties  that  a message  is  ready  (sender  to  receiver) 
and  that  it  can  be  transferred  (receiver  to  sender). 

To  conclude,  the  in-between  approach  does  not  seem  to  offer  any  real 
advantages.  The  waiting  approach,  while  simpler,  is  more  limiting  from  the 
point  of  concurrency  and  flexibility;  tnus  we  believe  that  we  should  provide 
the  type  of  constructs  outlined  in  Section  3.2. 


9.  SUMMARY  AMD  FUTURE  WORK 

fnis  report  presented  a basic  conceptual  model  of  distributed  computation 


in  an  environment  wnere  the  individual  nodes  of  the  underlying  network  are 


autonomous  and  may  be  spread  over  a Large  geographicaL  area.  Several 
important  Language  and  system  issues  were  discussed,  including  reLiabiLity, 
protection,  and  tne  Language  constructs  for  sending  and  receiving  messages  in 
tne  abstract  network.  We  tried  to  justify  our  decisions  regarding  each 
individual  topic;  of  course,  these  decisions  may  be  modified  as  we  gain  more 
experience  with  the  use  of  distributed  systems. 

Ic  should  be  realized  that  many  of  the  issues  that  are  being  attacked  in 
our  project  go  beyond  the  area  of  distributed  systems.  The  integration  of  a 
programming  language  ind  an  operating  system  is  one  of  these  issues.  The 
problem  of  such  a "total  environment"  has  not  yet  been  solved  satisfactorily 
even  for  a single  processor  system;  in  the  case  of  distributed  systems,  the 
design  task  is  even  more  difficult.  However,  while  for  a centralized  system 
tne  total  environment  is  a desirable  feature,  we  believe  that  for  a 
distributed  system  it  is  an  essential  property. 

Tne  worn  continues  on  several  LeveLs.  To  get  a better  understanding  of 
wnat  it  means  to  do  distributed  computing,  we  must  Implement  some  distributed 
applications.  This  means  that  we  need  a language  that  will  facilitate  such  a 
task.  Tne  CLU  language  [LISK77I  is  oeing  extended  to  incorporate  the  guardian 
construct  and  corresponding  message  passing  primitives.  Extended  CLU  will 
serve  as  an  experimental  base  for  both  the  language  work  and  the  design  of  the 
reliable  system  support.  Through  this  experiment,  we  hope  to  determine  the 
effectiveness  of  different  types  of  language  constructs  and  their 
implementab i L ity  on  conventional  hardware.  This  experience  will  be  used  to 
improve  the  Language,  and  possibly  to  guide  a design  of  a more  hospitable 
machine  architecture. 

On  tne  system  level,  research  concentr  ites  on  high  level  protocols  and 
mechanisms  needed  to  ichleve  reliable  operation.  The  feasibility  of  using 
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some  of  the  recentLy  developed  approaches  to  providing  atomic  operations 
[REEIX73;  >tONT7d;  TAKA731  in  real  systems  is  being  investigated.  Other  issues 
include  naming  and  copying  objects  in  the  abstract  network  [SOEL79I,  the 
guaranteed  delivery  support,  and  the  use  of  backup  copies. 

Ultimately,  the  success  of  our  programming  system  will  depend  on  its 
usefulness  in  constructing  distributed  applications.  in  addition  to  a design 
of  distributed  implementation  of  several  small  scale  sample  problems,  we 
envision  our  system  to  be  used  in  implementation  of  a substantial  distributed 
application;  one  of  the  primary  candidates  in  an  "office-automation"  of  our 
laooratory.  This  last  project  is  dependent  on  two  other  projects  within  LCS : 
development  of  a local  area  network  and  development  of  advanced  nodes  for  a 
distributed  system.  These  two  projects  will  provide  the  hardware  base 
necessary  for  experimentation  with  distributed  processing. 
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